library(tidyverse)
HistData::Snow.deaths %>% head() case x y
1 1 13.588010 11.095600
2 2 9.878124 12.559180
3 3 14.653980 10.180440
4 4 15.220570 9.993003
5 5 13.162650 12.963190
6 6 13.806170 8.889046
Spatial epidemiology is the description and analysis of geographically indexed health data with respect to demographic, environmental, behavioural, socio-economic, genetic, and infectious risk factors.
Some of the definitions of spatial epidemiology are:
The description of spatial patterns of disease incidence and mortality1
Spatial Epidemiology concerns the analysis of the spatial/geographical distribution of the incidence of disease2
Health phenomena often involve spatial relationships among individuals and risk factors.
This section provides an introduction to geographic information systems (GIS) in R. The field of GIS in healthcare has become extremely useful in providing a fresh outlook to public health. GIS provides an opportunity to enable overlaying data with its spatial representation to support better planning and decision-making in healthcare. The convergence of many new sub-disciplines such as medical geography, public health, health informatics and data science help us better understand the similarities and differences in population health across the world. Some of the applications of GIS in healthcare include disease surveillance, environmental health, infectious diseases such as mathematical modelling and agent based modelling, and even medical imagining. While traditional uses of GIS in healthcare still are relevant, newer methods and advancing technology would be monumental for public health research.
It is now being realized that the healthcare industry can benefit tremendously from the potential of GIS. Innovative ways are being developed to harness the power of GIS through data integration pipelines and spatial visualization. Both the public and private sectors are adopting GIS to provide value addition across different sectors of healthcare - from public health departments and public health policy and research organizations to hospitals, medical centres, and health insurance organizations. With this background, let us dive right in!
It must be evident by now that geographic information systems (GIS) try to answer the question of WHERE:
GIS enabled systems has the potential to offer crucial insights into such questions? However, it is also important that the right questions as being asked. In order to understand better on how to ask the right questions, we may have to revisit the history of GIS in Public Health.
Early examples using maps as tools to better understand disease and death have been applications such as medical mapping and disease topography. From early 17th Century, maps have not simply been used to illustrate a situation but also to prove an argument. Pioneering works in this regard have helped disprove the Theory of Miasma and identified that water as the cause of disease rather than air. Soon by the 18th Century medical maps saw many applications in public health such as plague, yellow fever, and cholera.
In 1854, an English physician, John Snow, provided the classic example of how GIS mapping can be used in epidemiological research. He identified the water source responsible for an outbreak of cholera in London by mapping the locations of those afflicted.
By plotting the number and location of fatalities using stacks of bars on a map, Snow was able to perform a task that is now easily taken for granted: he visualized a spatial distribution. Looking at the results, the pattern on the map seems unmistakable. The map appears to support Snow’s claims that cholera is a waterborne disease and that the pump on Broad Street is the source of the outbreak.
Despite its hand-drawn, back-of-the-envelope appearance, Snow writes:
“The inner dotted line on the map shows the various points which have been found by careful measurement to be at an equal distance by the nearest road from the pump in Broad Street and the surrounding pumps …”
With these words let us recreate John Snows famous map of Cholera in R.
But first, some best practices:
Geography and Geospatial Science Working Group (GeoSWG) has recognised the need for best practices in cartography. With this purpose Centre for Disease Control (CDC, Atlanta) has come up with the Cartographic Guidelines for Public Health. Adhering to these guidelines, help the researchers develop high-quality, consistent map products to promote the mission of public health.
Some of the important aspects of these guidelines are:
Projections transform the curved, three-dimensional surface of the earth into a flat, two-dimensional plane. All map projections have distortions (distance, area, direction, and/or shape). The choice of projection depends on the intended use of the map. An equal-area map projection is a good selection for portraying geographic data distributions and is suitable for most other maps. However, if the map is attempting to show distance from patients to providers, for example, then an equidistant projection is appropriate because it preserves distance.
We sometimes refer to coordinate systems (or grid systems) and datums in context with map projections. With the help of coordinate reference systems (CRS) every place on the earth can be specified by a set of three numbers, called coordinates. Various coordinate systems and datums are used throughout the world. Most mapping starts with a projected map and a coordinate system overlay, which enables locational referencing. Datums, based upon different ellipsoids (idealized versions of the shape of the earth), define the origin and orientation of latitude and longitude lines. The most recently developed and widely used datum is WGS 1984. Global positioning system (GPS) data are often collected in the WGS 1984 datum.
It is important to note that this CRS represented in can be represented in many ways. For example:
PROJ.4 format as +proj=longlat +datum=WGS84 +no_defs +type=crs, and in Well Known Text (WKT) format as
GEOGCS[“WGS 84”, DATUM[“WGS_1984”, SPHEROID[“WGS 84”,6378137,298.257223563]], PRIMEM[“Greenwich”,0], UNIT[“degree”,0.0174532925199433, AUTHORITY[“EPSG”,“9122”]], AUTHORITY[“EPSG”,“4326”]]
There are over 60 different file formats in GIS. They can be broadly classified into vector file formats and raster file formats.
Vector graphics are comprised of vertices and paths. The three basic symbol types in vector data are points, lines and polygons. Among vector data, the commonly used file formats are the ESRI Shape file extensions (.shp, .dbf, .shx), Geographic JavaScript Object Notation (.geojson, .json) and Google Keyhole Markup Language (.kml, .kmz). Please note that all files of the ESRI Shapefiles are necessary for the file to work properly.
On the contrary, raster data is made up of pixels, also referred to as grid cells). They are usually square but can also be hexagons or other shapes. Rasters have pixels that are associated with a value (continuous) or class (discrete). One of the most widely used raster formats are the GeoTiff extensions (.tif, .tiff).
With this background, we can now begin to recreate John Snow’s map.
sf packageThere are several packages that are available on CRAN that can perform spatial analytical tasks and operations. However, not all of them are having continued support from the original authors and are turning obsolete.
The figure below shows the different spatial packages on CRAN and their popularity, its is clear that the sf package has surpassed all other packages in terms of popularity and user downloads.
The sf package provides simple features access for R. Simple Features is a set of standards that specify a common storage and access model of geographic feature made of mostly two-dimensional geometries (point, line, polygon, multi-point, multi-line, etc.) used by geographic information systems. It is formalized by both the Open Geospatial Consortium (OGC) and the International Organization for Standardization (ISO).
A feature is thought of as a thing, or an object in the real world, such as a building or a tree. As is the case with objects, they often consist of other objects. A set of features can form a single feature. A tree in a forest can be a feature, a forest can be a feature, a city can be a feature. An image pixel from a satellite can be a feature, or a complete MRI scan image can be a feature too.
Today we shall be working with one of the in built datasets in R. There are other ways to import the different spatial files in R with the sf package. You can look into the documentation of the sf package for additional help (https://r-spatial.github.io/sf/). The datasets from the cholera outbreak in 1854 are available from the HistData package. The data sets used in this chapter are:
Snow.deathsSnow.streetsSnow.pumpsThe datasets are available as dataframe objects. We need to manipulate them into sf objects as it is the best practice (and also very intuitive and easy to work with). Let us look at the head of dataset Snow.deaths to understand its structure.
case x y
1 1 13.588010 11.095600
2 2 9.878124 12.559180
3 3 14.653980 10.180440
4 4 15.220570 9.993003
5 5 13.162650 12.963190
6 6 13.806170 8.889046
We can now manipulate the spatial file into an sf object.
The above code takes the dataset Snow.deaths from the HistData package and converts it into an sf object with additional arguments coords (which specifies the column representing the latitude and longitude respectively) and crs which takes the European Petroleum Survey Group (EPSG) code for the coordinate reference system (https://epsg.io). We are also creating it as a new object names snow_pumps for future reference.
We can check the class of the newly created object.
Lets print the object in the console and have a look.
Simple feature collection with 578 features and 1 field
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 8.280715 ymin: 6.090047 xmax: 17.93893 ymax: 16.97276
Geodetic CRS: WGS 84
# A tibble: 578 × 2
case geometry
* <int> <POINT [°]>
1 1 (13.58801 11.0956)
2 2 (9.878124 12.55918)
3 3 (14.65398 10.18044)
4 4 (15.22057 9.993003)
5 5 (13.16265 12.96319)
6 6 (13.80617 8.889046)
7 7 (13.10214 10.56081)
8 8 (11.00403 11.86713)
9 9 (15.15475 11.70451)
10 10 (11.12639 9.643859)
# ℹ 568 more rows
Congratulations! You’ve just created your first sf object.
Now, lets plot the sf object using the ggplot2’s geom_sf wrapper.
Now, Lets load and look at the streets of Soho, London. However, the streets are represented as points which might be tricky to manipulate. However, thanks to the mighty sf package, we can perform this task in an elegant way.
snow_streets <- HistData::Snow.streets %>%
st_as_sf(., coords = c('x', 'y'), crs = 4326) %>%
group_by(street) %>%
summarize(n = mean(n)) %>%
st_cast('LINESTRING')
snow_streets %>%
ggplot() +
geom_sf()Now lets overlay the cholera deaths over the streets. We can optionally color the deaths in red and set transparency at 20% for better visualisation.
ggplot() +
geom_sf(data = snow_streets) +
geom_sf(data = snow_deaths, color = 'red', alpha = 0.2, size = 3)Lets add the pumps now… Add nifty little labels using the function geom_sf_label.
snow_pumps <- HistData::Snow.pumps %>%
st_as_sf(., coords = c('x', 'y'), crs = 4326)
ggplot() +
geom_sf(data = snow_streets) +
geom_sf(data = snow_deaths, color = 'red',
alpha = 0.2, size = 3) +
geom_sf(data = snow_pumps , shape = 22,
size = 4, fill = 'blue', color = 'blue') +
geom_sf_label(data = snow_pumps, aes(label = label),
nudge_x = 0.025, nudge_y = -0.5)Congratulations on making your first map in R.
Exercise